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Abstract 

We statistically characterize the population spiking activity obtained from simultaneous recordings of neurons across all 
layers of a cortical microcolumn. Three types of models are compared: an Ising model which captures pairwise correlations 
between units, a Restricted Boltzmann Machine (RBM) which allows for modeling of higher-order correlations, and a semi- 
Restricted Boltzmann Machine which is a combination of Ising and RBM models. Model parameters were estimated in a fast 
and efficient manner using minimum probability flow, and log likelihoods were compared using annealed importance 
sampling. The higher-order models reveal localized activity patterns which reflect the laminar organization of neurons 
within a cortical column. The higher-order models also outperformed the Ising model in log-likelihood: On populations of 
20 cells, the RBM had 10% higher log-likelihood (relative to an independent model) than a pairwise model, increasing to 
45% gain in a larger network with 100 spatiotemporal elements, consisting of 10 neurons over 10 time steps. We further 
removed the need to model stimulus-induced correlations by incorporating a peri-stimulus time histogram term, in which 
case the higher order models continued to perform best. These results demonstrate the importance of higher-order 
interactions to describe the structure of correlated activity in cortical networks. Boltzmann Machines with hidden units 
provide a succinct and effective way to capture these dependencies without increasing the difficulty of model estimation 
and evaluation. 
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Introduction 

Electrophysiology is rapidly moving towards high density 
recording techniques capable of capturing the simultaneous activity 
of large populations of neurons. This raises the challenge of 
understanding how networks encode and process information in 
ways that go beyond tuning properties or feedforward receptive field 
models. Modeling the distribution of states in a network provides a 
way to discover communication patterns between neurons or 
functional groupings such as cell assemblies which may exhibit a 
more direct relation to stimulus or behavioral variables. 

The Ising model, originally developed in the 1920s to describe 
magnetic interactions [1], has been used to statistically characterize 
electrophysiological data, particularly in the retina [2], and more 
recently for cortical recordings [3,4] . This model treats spikes from a 
population of neurons binned in time as binary vectors and captures 
dependencies between cells with the maximum entropy distribution 
for pairwise dependencies. This has been shown to provide a good 
model for small groups of cells in the retina [5] , though it is unable 
to capture dependencies higher than second-order. 

In this work, we apply maximum entropy models to neural 
population recordings from the visual cortex. Cortical networks 
have proven more challenging to model than the retina: The 
magnitude and importance of pairwise correlations between 
cortical cells is controversial [6,7] and higher-order correlations. 



i.e. correlations which cannot be captured by a pair-wise 
maximum entropy model, play a more important role [8-10]. 
One of the challenges with current recording technologies is that 
we can record simultaneously only a tiny fraction of the cells that 
make up a cortical circuit. Sparse sampling together with the 
complexity of the circuit mean that the majority of a cell's input 
will be from cells outside the recorded population. In adult cat 
visual cortex, direct synaptic connections have been reported to 
occur between 1 1%-30% of nearby pairs of excitatory neurons in 
layer IV [11], while a larger fraction of cell pairs show 
"polysynaptic" couplings [12], defined by a broad peak in the 
cross-correlation between two cells. This type of coupling can be 
due to common inputs (either from a different cortical area or 
lateral connections) or a chain of monosynaptic connections. A 
combination of these is believed to give rise to most of the 
statistical interactions between recorded pairs of cells. The Ising 
model, which assumes only pairwise couplings, is well suited to 
model direct (and symmetric) synaptic coupling, but cannot 
capture interactions involving more than two cells. We propose a 
new approach, that addresses both incomplete sampling and 
common inputs from other cell assemblies, by extending the Ising 
model with a layer of hidden units or latent variables. The 
resulting model is a semi-Restricted Boltzmann Machine (sRBM), 
which combines pairwise connections between visible units with an 
additional set of connections to hidden units. 
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Author Summary 

Communication between neurons underlies all perception 
and cognition. Hence, to understand how the brain's 
sensory systems such as the visual cortex work, we need to 
model how neurons encode and communicate informa- 
tion about the world. To this end, we simultaneously 
recorded the activity of many neurons in a cortical column, 
a fundamental building block of information processing in 
the brain. This allows us to discover statistical structure in 
their activity, a first step to uncovering communication 
pathways and coding principles. To capture the statistical 
structure of firing patterns, we fit models that assign a 
probability to each observed pattern. Fitting probability 
distributions is generally difficult because the model 
probabilities of all possible states have to sum to one, 
and enumerating all possible states in a large system is not 
possible. Making use of recent advances in parameter 
estimation, we are able to fit models and test the quality of 
the fit to the data. The resulting model parameters can be 
interpreted as the effective connectivity between groups 
of cells, thus revealing patterns of interaction between 
neurons in a cortical circuit. 



Estimating the parameters of energy-based models, to which 
Ising models and Boltzmann machines belong, is computationally 
hard because these models cannot be normalized in closed form. 
For both Ising models and Boltzmann machines with hidden units, 
the normalization constant is intractable to compute, consisting of 
a sum over the exponential number of states of the system. This 
makes exact maximum likelihood estimation impossible for all but 
the smallest systems and necessitates approximate or computa- 
tionally expensive estimation methods. In this work, we use 
Minimum Probability Flow (MPF [13,14], in the context of neural 
decoding see [4,15]) to estimate parameters efficiently without 
computing the intractable partition function. It provides a 
straightforward way to estimate the parameters of Ising models 
and Boltzmann machines for high-dimensional data. 

Another challenge in using energy-based models is the 
evaluation of their likelihood after fitting to the data, which is 
again made difficult due to the partition function. To compute 
probabilities and compare the likelihood of different models, 
annealed importance sampling (AIS) [16] was used to estimate the 
partition function. 

Combining these two methods for model estimation and 
evaluation, we show that with hidden units, Boltzmann machines 
can capture the distribution of states in a microcolumn of cat 
visual cortex significantly better than an Ising model without 
hidden units. The higher-order structure discovered by the model 
is spatially organized and specific to cortical layers, indicating that 
common input or recurrent connectivity within individual layers of 
a microcolumn are the dominant source of correlations. Applied to 
spatiotemporal patterns of activity, the model captures temporal 
structure in addition to dependencies across different cells, 
allowing us to predict spiking activity based on the history of the 
network. 

Results 

Modeling lanninar population recordings 

We estimated Ising, RBM and sRBM models for populations of 
cortical cells simultaneously recorded across all cortical layers in a 
microcolumn of cat VI in response to long, continuous natural 
movies presented at a frame rate of 150 Hz. Code for the model 



estimation is available for download at http://github.com/ursk/ 
srbm. Fig. la) shows an example frame from one of the movies. 
Model parameters were estimated using MPF with an Li 
regularization penalty on the model parameters to prevent 
overfitting. To compute and compare likelihoods, the models 
were normalized using AIS. Here we present data from two 
animals, one with 22 single units (B4), another with 36 units (T6), 
as well as a multiunit recording with 26 units (B4M). Fig. lb) shows 
spiking data from session B4 in 20 ms bins, with black squares 
indicating a spike in a bin. Spatiotemporal datasets were 
constructed by concatenating spikes from consecutive time bins. 
Pairs of cells show weak positive correlations, shown in Fig. Ic), 
and noise correlations computed from 60 repetitions of a 30s 
stimulus are similarly small and positive. For all recordings, the 
population was verified to be visually responsive and the majority 
of cells were orientation selective simple or complex cells. As 
recordings were performed from a single cortical column, 
receptive fields shared the same retinotopic location and have 
similar orientation selectivity, differing mostly in size, spatial 
frequency and phase selectivity. See [17] for a receptive field 
analysis performed on the same raw data. 

Pairwise and higher-order nnodels 

The estimated model parameters for the three different types of 
models (Ising, RBM and sRBM) are shown in Fig. 2 for session B4. 
The Li sparseness penalty, chosen to optimize likelihood on a 
validation dataset, results in many of the parameters being zero. 
For the Ising model (a) we show the coupling as a matrix plot, with 
lines indicating anatomical layer boundaries. The diagonal 
contains the bias terms, which are negative since all cells are off 
the majority of the time. The matrix has many small positive 
weights that encourage positive pairwise correlations. 

In (b) we show the hidden units of the RBM as individual bar 
plots, with the bars representing connection strengths to visible 
units. The topmost bar corresponds to the hidden bias of the unit, 
and hidden units are ordered from highest to lowest variance. The 
units are highly selective in connectivity: The first unit almost 
exclusively connects to cells in the deep (granular and subgranular) 
cortical layers. The second unit captures correlations between cells 
in the superficial (supergranular) layers. The correlations are of 
high order, with 10 and more cells receiving input from a hidden 
unit. The remaining units connect fewer cells, but still tend to be 
location-specific. Only the hidden units that have non-zero 
couplings are shown. Additional hidden units are turned off by 
the Li sparseness penalty, which was chosen to maximize 
likelihood on the cross-validation dataset. The interpretation of 
hidden units is quite similar to the pairwise terms of the Ising 
model: positive coupling to a group of visible units encourages 
these units to become active simultaneously, as the energy of the 
system is lowered if both the hidden unit and the cells it connects 
to are active. Thus the hidden units become switched on when 
cells they connect to are firing (activation of hidden units not 
shown). 

The sRBM combines both pairwise and hidden connections 
and hence is visualized with a pairwise coupling matrix and bar 
plots for hidden units. With the larger number of parameters, the 
best model is even more sparse in the number of nonzero 
parameters. The remaining pairwise terms predominantly encode 
negative interactions, and much of the positive coupling has been 
explained away by the hidden units. These give rise to strong 
positive couplings within either superficial (II /III) or intermediate 
(IV) and deep (V/VI) layers, which explain the majority of 
structure in the data. The more succinct explanation for 
dependencies between recorded neurons is via connections to 



PLOS Computational Biology | www.ploscompbiol.org 



2 



July 2014 I Volume 10 | Issue 7 | e1003684 



Higher-Order Correlations in Cortical Microcolumns 



a) Stimulus frames 



c) Pairwise Correlations, 20ms bins 





0 0.1 0 0.1 0 0.05 
Raw Correlation Noise Shuffled 



b) Example data, 2s of data in 20ms bins 




Figure 1. Laminar population recordings in response to natural movies, (a) Example of one of the natural movie stimuli, depicting ducks on 
a lawn, presented full-field at 150 frames per second, (b) Example data from 22 cells (session B4), binned in 20 ms windows, 2 s of data. Columns of 
this matrix form the training data for our algorithm. For the spatiotemporal version of the model, several adjacent columns are concatenated, (c) 
Pairwise correlations in the raw data, and noise correlations computed from 60 repetitions of a 30 s stimulus, binned at 20 ms. Both show small, 
positive correlations. Shuffling spikes for each of the cells shows that correlations expected due to shared firing rate modulations time-locked to the 
stimulus are much smaller. 
doi:1 0.1 371 /journal.pcbi.1 003684.g001 



shared hidden units, rather than direct couplings between visible 
units. The RBM and sRBM in this comparison were both 
estimated with 22 hidden units, but we show only units that did 
not turn off entirely due to the sparseness penalty. In this example, 
a sparseness penalty ofl = 2x 10~^ was found to be optimal for 
all three models. 

Including stinnulus-driven connponents 

In order to ascertain to what degree the stimulus driven 
component of activity accounts for the learned higher-order 
correlations, we augmented the above models with a dynamic bias 
term that consists of the log of the average instantaneous firing 
probability of each cell over repeated presentations of the same 
stimulus. In the case that all trained parameters were zero, this 
model would assign a firing probability to all neurons identical to 
that in the peri-stimulus time histogram (PSTH). 

In Fig. 2d) the couplings for the Ising model with stimulus terms are 
shown. As the pairwise couplings now only capture additional 
structure beyond correlations explained by the stimulus, they tend to 
be weaker than in the Ising model without stimulus terms. In 
particular the bias terms on the diagonal are almost completely 
explained away by the dynamic bias. The same reasoning applies to 
the RBM with PSTH terms, which is shown in e). Although the 
couplings are weaker than for the pure RBM, the basic structure 
remains, with the first two hidden units explaining correlations within 
superficial and deep groups of cells, respectively. This shows that the 
learned coupling structure can not be explained purely from higher- 
order stimulus correlations and receptive field overlap. Even when 
stimulus-induced correlations are fully accounted for, the correlation 
structure captured by the RBM remains similar and higher-order 
correlations are the dominant driver of correlated firing. 

Model connparison 

For a quantitative comparison between models, we computed 
normalized likelihoods using Annealed Importance Sampling 



(AIS) to estimate the partition function. For each model, we 
generated 500 samples through a chain of 10^ annealing steps. To 
ensure convergence of the chain, we use a series of chains varying 
the number of annealing steps and verify that the estimate of the 
partition function Z stabilizes to within at least 0.02 bits/ s (see Fig. 
SI). For models of size 20 and smaller we furthermore computed 
the partition function exactly to verify the AIS estimate. 

Fig. 3a) shows a comparison of excess log likelihood C for the 
three different models and on all three datasets. £, which we 
define as the gain in likelihood over the independent firing rate 
model, is computed in units of bits/ spike for the full population. 
Both higher-order models outperform the Ising model in fitting the 
datasets significantly. Error bars are standard deviation computed 
from 1 0 models with different random subsets of the data used for 
learning and validation, and different random seeds for the 
parameters and the AIS sampling runs. 

Fig. 3b) shows the excess log likelihood for the models with 
stimulus terms. Due to the additional computational complexity, 
these models were only estimated for the small B4 data set. The 
left of the two bar plots shows that including the stimulus 
information through the PSTH greatly increases the likelihood, 
even the PSTH only model without coupling terms outperforms 
the Ising and RBM models by about 0.7 bits/ s. Including coupling 
terms still increases the likelihood, which is particularly visible on 
the right bar plot which shows the log likelihood gain relative to 
the PSTH model. Including higher-order coupling terms still 
provides a significant gain over the pairwise model, confirming 
that there are higher-order correlation in the data beyond those 
induced by the stimulus. 

Each of the models was estimated for a range of sparseness 
parameters A= [0,1,2,4,6,8,10] x 10~^ bracketing the optimal X 
using 4-fold cross-validation on a holdout set, and the results are 
shown for the optimal choice of X for each model. 

Additional insight into the relative performance of the models 
can be gained by comparing model probabilities to empirical 
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a) Ising model 



b) RBM hidden units 
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Figure 2. Functional connectivity patterns of the three models estimated for recording session B4. The horizontal lines indicate 
approximate boundaries between cortical layers ll/lll, layer IV and layers V/VI. (a) Ising model coupling matrix. Each row/column of the matrix 
corresponds to a neuron, bias terms are shown on the diagonal. The model has many small coupling terms that encode positive correlations, (b) The 
RBM coupling weights are shown as a bar chart for each hidden unit, ordered from left to right from largest to smallest activity. The first bar chart is 
the bias for all the visible units, and the blue bar at the top of each plot corresponds to the bias of that hidden unit. Blue bars indicate negative values 
(the bias terms are predominantly negative, but plotted with flipped sign to fit on the scale of the remaining terms), (c) The sRBM weights are shown 
in the same way, with the pairwise couplings on the left and hidden units on the right. The pairwise connections are qualitatively very different from 
those of the Ising model, as most of the structure is better captured by hidden units, (d) Pairwise coupling terms for the Ising model with stimulus 
terms. Much of the structure, in particular the bias terms, have been explained away by the stimulus terms, (e) Hidden unit couplings for the RBM 
with stimulus terms. The structure of the hidden units remains largely unchanged, indicating higher-order couplings are due to network interactions 
and not stimulus correlations. 
doi:1 0.1 371 /journal.pcbi.1 003684.g002 



probabilities for the various types of patterns. Fig. 4 shows scatter 
plots of model probabilities under the different models against 
pattern frequencies in the data. Patterns with a single active cell, 
two simultaneously active cells, etc. are distinguished by different 
symbols. As expected from the positive correlations, the indepen- 
dent model (yellow) shown in a) consistently overestimates the 
probabilities of cells being active individually, so these patterns fall 
above the identity line, while all other patterns are underestimated. 
For comparison, the Ising model is shown in the same plot (blue), 
and does significantly better, indicated by the points moving closer 
to the identity line. It still tends to fail in a similar way though, with 
many of the triplet patterns being underestimated as the model 
cannot capture triplet correlations. In b), this model is directly 
compared against the RBM (green). Except for very rare patterns, 
most points are now very close to the identity line, as the model 
can fully capture higher-order dependencies. Hidden units 
describe global dependencies that greatly increase the frequency 
of high order patterns compared to individually active cells. The 



5% and 95% confidence intervals for the counting noise expected 
in the empirical frequency of states are shown as dashed lines. The 
solid line is the identity. Inserts in both models show the 
distribution of synchrony, P(K), where K is the number of cells 
simultaneously firing in one time bin. This metric has been used 
for example in [18] to show how pairwise models fail to capture 
higher-order dependencies. In the case of the T6 data set with 36 
cells shown here, the Ising model and RBM both provide a good 
fit to the distribution of synchrony in the observed data. 

Note that any error in estimating the partition function of the 
models would lead to a vertical offset of all points. Thus visually 
checking the alignment of the data cloud around the identity line 
provides a visual verification that there are no catastrophic errors 
in the estimation of the partition function. Unfortunately we 
cannot use this alignment as a shortcut to compute the partition 
function without sampling, e.g. by defining Z such that the all 
zeros state has the correct frequency, as this assumes a perfect 
model fit. For instance, Li regularization tends to reduce model 
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a) Network only 
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Figure 3. Model comparison using likelihood gain over the 
independent model. Likelihoods are normalized to bits/spike to 
account for different population size as well as firing rate, (a) The change 
in performance with dataset size (22 cells for session B4, 26 cells for MU, 
and 36 cells for T6) is thus due to additional structure captured from 
larger populations. B4 and T6 are spike sorted, B4IV1 is a multiunit dataset. 
All three models outperform the independent model by 0.4-0.8 bits/ 
spike. The higher-order models with hidden units give a small (0.03-0.04 
bits/spike, about 10%) improvement over the Ising model for the small 
datasets, growing to 0.18 bits/spike, about 28%, for the dataset with the 
large population size, (b) Including stimulus terms provides a large gain 
in likelihood, even the stimulus PSTH term alone outperforms the 
network models by a large margin for this dataset. There is still a 
significant gain by including coupling terms. The difference between the 
second order Ising model and higher-order RBM is particularly visible in 
the right hand plot which shows the gain relative to the PSTH only 
model. All error bars indicate one standard deviation over repeated 
estimation on different random subsets of the data for training and 
validation and random initializations of AIS estimation. 
doi:1 0.1 371 /journal.pcbi.1 003684.g003 

probabilities of the most frequent states, so this estimate of Z 
would systematically overestimate the likelihood of regularized 
models. We note, however, that for higher-order models with no 
regularization this estimate does indeed agree well with the AIS 
estimate. 



Spatiotennporal nnodels 

The same models can be used to capture spatiotemporal 
patterns by treating previous time steps as additional cells. 
Consecutive network states binned at 6.7 ms were concatenated 
in blocks of up to 13 time steps, for a total network dimensionality 
of 130 with 10 cells. These models were cross-validated and the 
sparseness parameters optimized in the same way as for the 
instantaneous model. This allows us to learn kernels that describe 
the temporal structure of interactions between cells. 

In Fig. 5 we compare the relative performance of spatiotem- 
poral Ising and higher-order models as a function of the number of 
time steps included in the model. To create the datasets, we picked 
a subset of 1 0 cells with the highest firing rates from the B4 dataset 
(4 cells from subgranular, 2 from granuar and 4 from super- 
granular layers) and concatenated blocks of up to 13 subsequent 
data vectors. This way models of any dimensionality divisible by 
1 0 can be estimated. The number of parameters of the RBM and 
Ising model were kept the same by fixing the number of hidden 
units in the RBM to be equal to the number of visible units, the 
sRBM was also estimated with a square weight matrix for the 
hidden layer. As before, the higher-order models consistently 
outperform the Ising model. The likelihood per spike increases 
with the network size for all models, as additional information 
from network interactions leads to an improvement in the 
predictive power of the model. The curve for the Ising model 
levels off after a dimensionality of about 30 is reached, as higher- 
order structure that is not well captured by pairwise coupling 
becomes increasingly important. However, the likelihood of 
higher-order models continues to increase through the entire 
experimental range. 

The insert in the figure shows the entropy of the models, 
normalized by the data dimensionality by dividing by the number 
of frames and neurons. The entropy was computed as 
5'= -<log(/?(x))>^(x) = <£'(x)>^(x)+log(Z) where the expecta- 
tion of the energy was estimated by initializing 100,000 samples 
using the holdout data set, and then running 2000 steps of Gibbs 
sampling. Due to temporal dependencies additional frames carry 
less entropy, but we do not reach the point of extensivity where the 
additional entropy per frame reaches a constant value. As the 
RBM is better able to explain additional structure in new frames. 




Figure 4. Scatter plot showing empirical probabilities against model probabilities on T6 test dataset. Different models are distinguished 
by color, the number of simultaneously spiking cells in each pattern by different symbols, (a) shows the independent model compared to the Ising 
model, (b) shows the Ising model compared to the RBM. The sRBM is omitted as it is very similar to the RBM. The RBM significantly outperforms 
simpler models. Inserts in both models show the distribution of synchrony, P{K), where K is the number of cells simultaneously firing in one time 
bin. The synchrony in the empirical distribution (red line) is greatly underestimated by the independent model (yellow), but well captured by both 
the pairwise (blue) and higher-order model (green). 
doi:1 0.1 371 /journal.pcbi.1 003684.g004 
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Figure 5. Likelihood comparison as a function of model size. 

Spatiotemporal models with 10 cells and a varying number of 
concatenated time steps. The log-likelihood per spike increases as 
each neuron is modeled as part of a longer time sequence. This effect 
holds both for Ising and higher-order models. Since the Ising model 
cannot capture many of the relevant dependencies, the increase in 
likelihood saturates after about 3 timesteps for the Ising model, but 
continues to increase for the higher-order models. Insert: Comparison 
of the entropy per time slice for Ising and RBM models as a function of 
model size. As the RBM is better able to model spatiotemporal 
dependencies, the additional entropy for extra frames is smaller than 
for the Ising models. The RBM does not reach the point of extensivity, 
where additional frames add constant entropy. Multiple lines of the 
same color indicate repeated runs with different random initialization. 
doi:1 0.1 371 /journal.pcbi.1 003684.g005 

the additional entropy for new frames is much less than for the 
Ising model. 

A similar observation has been made in [5], where Ising and 
higher-order models for 100 retinal ganglion cells were compared 
to models for 10 time steps of 10 cells. It is noteworthy that 
temporal dependencies are similar to dependencies between 
different cells, in that there are strong higher-order correlations 
not well described by pairwise couplings. These dependencies 
extend surprisingly far across time (at least 87 ms, corresponding 
to the largest models estimated here) and are of such a form that 
including pairwise couplings to these states does not increase the 
likelihood of the model. This has implications e.g. for GLMs that 
are typically estimated with linear spike coupling kernels which 
will likely miss these interactions. 

To predict spiking based on the network history, we can 
compute the conditional distribution of single units given the state 
of the rest of the network. This is illustrated for a network with 1 5 
time steps for a dimensionality of 150. This model is not included 
in the above likelihood comparison, as the AIS normalization 
becomes very expensive for this model size. Fig. 6a) shows the 
learned weights of 1 8 randomly chosen nonzero hidden units for a 
spatiotemporal RBM model with 150 hidden units. Each subplot 
corresponds to one hidden unit, which connects to 10 neurons 
(vertical axis) across 15 time steps or 100 ms (horizontal axis). 
Some units specialize in spatial coupling across different cells at a 
constant time lag. As the model has no explicit notion of time, the 
time lag of these spatial couplings is not unique and the model 
learns multiple copies of the same spatial pattern. Thus while there 
are 55 nonzero hidden units, the number of unique patterns is 
much smaller so that the effective representation is quite sparse. 
The remaining units describe smooth, long-range temporal 
dependencies, typically for small groups of cells. Both of these 



subpopulations capture higher-order structure connecting many 
neurons that cannot be well approximated with pairwise 
couplings. 

By conditioning the probability of one cell at one time bin on 
the state of the remaining network, we can compute how much 
information about a cell is captured by the model over a naive 
prediction based on the firing rate of the cell. This conditional 
likelihood for each cell is plotted in Fig. 6b) in a similar way to 
excess log likelihood for the entire population in Fig. 5, except in 
units of bits per second rather than bits per spike. While the result 
here reflects our previous observation that Boltzmann machines 
with hidden units outperform Ising models, we note that the 
conditional probabilities are easily normalized in closed form since 
they describe a one-dimensional state space. Thus we can ensure 
that the likelihood gain holds independent of the estimation of Z 
and is not due to systematic errors in sampling from the high- 
dimensional models. Fig. 6c) provides a more intuitive look at the 
prediction. For 1 s of data from one cell, where 5 spikes occur, we 
show the conditional firing probabilities for the three models given 
100 ms of history of itself and the other cells. Qualitatively, the 
models perform well in predicting spiking probabilities, suggesting 
it might compare favorably to prediction based on GLMs or Ising 
models [19]. 

Discussion 

Contributions 

While there has been a resurgence of interest in Ising-like 
maximum entropy models for describing neural data, progress has 
been hampered mainly by two problems. First, estimation of 
energy based models is difficult since these models cannot be 
normalized in closed form. Evaluating the likelihood of the model 
thus requires approximations or a numerical integral over the 
exponential number of states of the model, making maximum 
likelihood estimation computationally intractable. Even the 
pairwise Ising model is typically intractable to estimate, and 
various approximations are required to overcome this problem. 
Second, the number of model parameters to be estimated grows 
very rapidly with neural population size. If correlations up to n^^ 
order are considered, the number of parameters is proportional to 
[number of neurons]". In general, fully describing the distribution 
over states requires a number of parameters which is exponential 
in the number of neurons. This can be dealt with by cutting off 
dependencies at some low order, by estimating only a small 
number of higher-order coupling terms, or by imposing some 
specific form on the dependencies. 

We attempted to address both of these problems here. 
Parameter estimation was made tractable using MPF, and latent 
variables were shown to be an effective way of capturing high 
order dependencies. This addresses several shortcomings that have 
been identified with the Ising model. 

Shortcomings of Ising models 

As argued in [20], models with direct (pairwise) couplings are 
not well suited to model data recorded from cortical networks. 
Since only a tiny fraction of the neurons making up the circuit are 
recorded, most input is likely to be common input to many of the 
recorded cells rather than direct synapses between them. While 
this work compares Generalized Linear Models (GLMs) such as 
models of the retina [21] and for LGN [22] to linear dynamical 
systems (EDS) models, the argument applies equally for the models 
presented here. 

Another shortcoming of the Ising model and some previous 
extensions is that the number of parameters to be estimated does 
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Figure 6. Spatiotemporal network models, (a) Hidden units of spatiotemporal sRBM model. For each hidden unit, the horizontal axis is time and 
the vertical axis cells with horizontal bars separating the subgranular, granular and supergranular cortical layers, (b) Log-likelihood gain for each of 
the 10 cells (ordered by firing rate) conditioned on the remainder of the network state for all three models, (c) Spike prediction from network history. 
For one of the cells, we show 1 s of predicted activity given the history of the network state. In each case when a spike occurs in the data (gray bars), 
there is an elevated probability under the models (colored bars). 
doi:10.1371/journal.pcbi.1003684.g006 



not scale favorably with the dimensionality of the network. The 
number of pairwise coupling terms in GLM and Ising models 
scales with the square of the number of neurons, so with the 
amounts of data typically collected in electrophysiological exper- 
iments it is only possible to identify the parameters for small 
networks with a few tens of cells. This problem is aggravated by 
including higher-order couplings: for example the number of third 
order coupling parameters scales with the cube of the data 
dimensionality. Therefore attempting to estimate these coupling 
parameters directly is a daunting task that usually requires 
approximations and strong regularization. 

Higher-order correlations in small networks 

Early attempts at modeling higher-order structure side-stepped 
these technical issues by focussing on structure in very small 
networks. Ohiorhenuan noted that Ising models fail to explain 
structure in cat visual cortex [8] and was able to model triplet 
correlations [9] by considering very small populations of no more 
than 6 neurons. Similarly, Yu et al. [3,10] show that over the scale 
of adjacent cortical columns of anesthetized cat visual cortex, small 
subnetworks of 10 cells are better characterized with a dichoto- 
mized Gaussian model than the pairwise maximum entropy 
distribution. While the dichotomized Gaussian [23] is estimated 
only from pairwise statistics, it carries higher-order correlations 
that can be interpreted as common Gaussian inputs [24] . However 
these correlations are implicit in the structure of the model and not 
directly estimated from the data as with the RBM, so it is not clear 
that the model would perform as well on different datasets. 

Modeling large networks 

Given that higher-order correlations are important to include in 
statistical models of neural activity, the question turns to how these 
models can be estimated for larger data sets. In this section, we 
focus on two approaches that are complementary to our model 



using hidden units. The increasing role of higher-order correla- 
tions in larger networks was first observed in [25], where Ising 
models were fit via MCMC methods to the same 40 cell retina 
dataset that was analyzed in terms of subsets of 1 0 cells in [2] . This 
point is further emphasized by Schneidman and Ganmor in [5], 
who caution that trying to model small subsets (10 cells) of a larger 
network to infer properties of the full network may lead to 
incorrect conclusions, and show that for retinal networks higher- 
order correlations start to dominate only once a certain network 
size is reached. 

Therefore they address the same question as the present paper, 
i.e. how to capture n^^ order correlations without the accompa- 
nying d'^ growth in the number of free parameters in a larger 
network. In their proposed Reliable Interaction Model (RIM), 
they exploit the sparseness of the neural firing patterns to argue 
that it is possible to explicitly include third, forth and higher-order 
terms in the distribution, as most higher-order coupling terms will 
be zero. Therefore the true distribution can be well approximated 
from a small number of these terms, which can be calculated using 
a simple recursive scheme. In practice, the main caveat is that only 
patterns that appear in the data many times are used to calculate 
the coupling terms. While the model by construction assigns 
correct relative probabilities to observed patterns, the probability 
assigned to unobserved patterns is unconstrained, and most of the 
RIM's probability mass may thus be assigned to states which never 
occur in the data. 

The second alternative to the RBM with hidden units is to 
include additional low-dimensional constraints in an Ising model. 
In the "K-pairwise" model [18,26], in addition to constraining 
pairwise correlations, a term is introduced to constrain the 
probability of k neurons being active simultaneously. This adds 
very little model complexity, but significantly improves the model 
of the data. This is shown, for example, by computing conditional 
predictions in a similar fashion to that shown in Fig. 6c), where the 
K-pairwise model for a population of 100 retinal ganglion cells has 
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an 80% correlation with the repeated trial PSTH. In contrast to 
the RBM, however, this model is not structured in a way that can 
be easily interpreted in terms of functional connectivity. To 
estimate these models for large numbers (^>100) of neurons, the 
authors leverage the sampling algorithm described in [27], an Li- 
regularized histogram Monte Carlo method. 

In addition to proposing a faster (though slower than MPF) 
parameter estimation method for this class of models, Tkacik and 
colleagues address the difficulty in sampling from the model and 
computing the partition function. In our experiments the overall 
limiting factor is the Gibbs sampler in the AIS partition function 
estimation. Tkacik et al. use a more efficient sampling algorithm 
(Wang-Landau) to compute partition functions and entropy of 
their models. As an even simpler approach to the partition 
function problem, they suggest that it can be obtained in closed 
form if the empirical probability of at least one pattern in the data 
is accurately known. A case in point is the all zeros pattern that is 
typically frequent for recordings with sparsely firing neurons. 
Unfortunately, this approach is limited in that it assumes that the 
probability the model assigns to the state is identical to the 
empirical probability of the state. In the case that the model has 
not been perfectly fit to the data, or in the case that the data does 
not belong to the model class, this wUl lead to an incorrect estimate 
of the partition function. 

Modeling stimulus-driven connponents 

Since the activity we are modeling is in response to a specific 
stimulus, one may rightfully question whether the observed higher- 
order correlations in neural activity are simply due to higher-order 
structure contained in the stimulus, as opposed to being an 
emergent property of cortical networks. In an attempt to tease 
apart the contribution of the stimulus, we included a nonpara- 
metric PSTH term in the model. However, this can capture 
arbitrarily complex stimulus transformations using the trial- 
averaged response to predict the response to a new repetition of 
the same stimulus. As an "oracle model", it does not only capture 
the part of the response that could be attributed to a feed-forward 
receptive field, but also captures contextual modulation effects 
mediated by surrounding columns and feedback from higher brain 
areas, essentially making it "too good" as a stimulus model. The 
RBM and Ising models are then relegated to merely explain the 
trial to trial variability in our experiments. Not including stimulus 
terms and finding the best model to explain the correlations 
present in the data, irrespective of whether they are due to 
stimulus or correlated variability, seems to be an equally valid 
approach to discover functional connectivity in the population. 

Relation to GLMs 

GLMs [21] can be used to model each cell conditioned on the 
rest of the population. While mostly used for stimulus response 
models including stimulus terms, they are easily extended with 
terms for cross-spike coupling, which capture interactions between 
cells. GLMs have been successfully augmented with latent 
variables [20], for instance to model the effect of common noisy 
inputs on synchronized firing at fast timescales [28]. A major 
limitation of GLMs is that current implementations can only be 
estimated efficiently if they are linear in the stimulus features and 
network coupling terms, so they are not easily generalized to 
higher-order interactions. Two approaches have been used to 
overcome this limitation for stimulus terms. The GLM can be 
extended with additional nonlinearities, preserving convexity on 
subproblems [22]. Alternatively, the stimulus terms can be 
packaged into nonlinear features which are computed in 
preprocessing and usually come with the penalty of a large 



increase in the dimensionality of the problem [29]. However, we 
are not aware of any work applying either of these ideas to spike 
history rather than stimulus terms. Another noteworthy drawback 
of GLMs is that instantaneous coupling terms cannot be included 
[20], so instantaneous correlations cannot be modeled and have to 
be approximated using very fine temporal discretization. 

Conclusion 

The RBM provides a parsimonious model for higher-order 
dependencies in neural population data. Without explicitly 
enumerating a potentially exponential number of coupling terms 
or being constrained by only measurements of pairwise correla- 
tions, it provides a low-dimensional, physiologically interpretable 
model that can be easily estimated for populations of 1 00 or more 
cells. 

The connectivity patterns the RBM learns from cells simulta- 
neously recorded from all cortical layers are spatially localized, 
showing that small neural assemblies within cortical layers are 
strongly coupled. This suggests that cells within a layer perform 
similar computations on common input, while cells across different 
cortical layers participate in distinct computations and have much 
less coupled activity. This novel observation is made possible by 
the RBM: because each of the hidden units responds to (and 
therefore learns on) a large number of recorded patterns, it can 
capture dependencies that are too weak to extract with previous 
models. In particular, the connectivity patterns discovered by the 
RBM and sRBM are by no means obvious from the covariance of 
the data or by inspecting the coupling matrix of the Ising model. 
This approach, combining a straightforward estimation procedure 
and a powerful model, can be extended from polytrode recordings 
to capture physiologically meaningful connectivity patterns in 
other types of multi-electrode data. 

Materials and Methods 

Recording and experimental procedures 

Ethics statement. The protocol used in the experiments was 
approved by the Institutional Animal Care and Use Committee at 
Montana State University and conformed to the guidelines 
recommended in Preparation and Maintenance of Higher 
Mammals During Neuroscience Experiments, National Institutes 
of Health Publication 913207 (National Institutes of Heath, 
Bethesda, MD 1991). 

Visual stimuli. Three movies of 8, 20 and 30 minutes 
duration were captured at 300 frames per second and 512 x 384 
pixel resolution with a Casio Fl camera. All movies were recorded 
on the campus of MSU Bozeman and depicted natural scenes such 
as ducks swimming on a pond, selected to contain heterogeneous 
motion across the scene. The camera was mounted on a tripod to 
avoid camera motion and there were no scene changes within 
movies, so the spatiotemporal statistics of the movie do not contain 
artifacts from camera motion or cuts. Movies were converted to 
grayscale with no additional contrast normalization on top of the 
in-camera processing, and temporally down-sampled to 150 Hz 
for presentation. The high frame rate was chosen to prevent cells 
phase locking to the frame rate, and scene changes were 
minimized to avoid evoked potentials due to sudden luminance 
changes. The movies were presented at 150 frames per second on 
a 2 1" CRT monitor that was calibrated for a linear response. Each 
of the three movies was presented once and a 30 s segment of the 
first movie was presented for 60 repeated trials. 

Recording methods. Data were recorded from anesthetized 
cat visual cortex in response to a custom set of full field natural 
movie stimuli. The surgical methods are described in detail 
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elsewhere [30] . In brief, the anesthetized animals were mounted in 
a stereotaxic frame and a small craniotomy was made over area 
1 7. The dura was reflected and agar in artificial cerebrospinal fluid 
was applied to protect the cortical surface and reduce pulsations. 
The polytrode was slowly lowered perpendicularly into cortex 
using a hydraulic microdrive. Recordings were made with single 
shank 32 channel polytrodes (Neuronexus Technologies Al x32) 
with a channel spacing of 50|im, contact diameter of 23|im and 
thickness of 15|im, spanning all the layers of visual cortex. 
Individual datasets had on the order of 20 to 40 simultaneously 
recorded neurons. 

For spike sorting, the 32 polytrode channels were treated as 8 
non-overlapping groups, shown in Fig. S 2a). To each group, a 
standard tetrode spike sorting method was applied [30]: In brief, 
spike waveforms were extracted at a threshold of 6a and features 
(area, energy, peak, valley, peak valley ratio, width, trigger value 
and the first three principal components) computed for each 
channel in the group. In this feature space the data was clustered 
using k-means clustering (KlustaKwik, [31]) with a manual 
cleanup using the MClust Matlab package [32]. See Fig. S 2b) 
for cluster isolation for an example channel group. Cleanup 
consisted of identifying and removing artifacts, and merging 
clusters belonging to the same cell. Clusters were then labeled as 
either single- or multi-unit activity based on the following 
diagnostics: Inspection of distribution of waveform amplitudes 
on all four channels to ensure clusters were well-separated and no 
spikes were missing due to the thresholding; inspection of cross- 
correlograms to identify cells that appeared on two neighboring 
tetrode groups or triggered on two channels within a group; 
inspection of inter-spike interval distributions to identify multiunit 
activity based on intervals shorter than 1 ms and thus below the 
refractory period of a single cell; inspection of raw waveforms and 
the time-course of waveform amplitudes. Clusters that passed all 
criteria were added to the single unit data set, for which ISI 
histograms are shown in Fig. S 2c), remaining spikes were 
considered multi-unit activity. Unless noted otherwise, spikes for 
all data sets were binned at 20 ms where bins with a single spike 
(2.4% of bins) and multiple spikes (0.9% of bins) were both treated 
as spiking and the rest as non-spiking. The bin size was chosen to 
span the width of the central peak in cross-correlograms between 



pairs of cells such as shown in Fig. 7c), where the dashed vertical 
lines at + 20ms envelope the central peak. 

Histological procedures. To register individual recording 
channels with cortical layers, recording locations were recon- 
structed from Nissl-stained histological sections, and current 
source density (CSD) analysis in response to 100 repetitions of a 
full-field stimulus flashed at 1 Hz was used to infer the location of 
cortical layer IV on the polytrode [33]. Spike-sorted units were 
assigned to layers based on the channel with the largest amplitude. 
In Fig. 7a) we use horizontal lines to indicate cortical layer 
boundaries and their position relative to the polytrode. Fig. 7b) 
shows the CSD response for this session, showing a strong current 
sink in layer IV and to a lesser degree in layer VI. 

Datasets. The models were estimated on a data set of 
180,000 data vectors corresponding to approximately 60 minutes 
of recording time. The data were split into two subsets of 90,000 
by random assignment of data points: a training set for parameter 
estimation and a test set to compute cross-validated likelihoods. 

For the stimulus dependent model, a separate data set of 80,000 
data vectors was used, taken from a 30 s long movie that was 
repeated for 60 presentations, and again split into a training and 
test set of equal size. No separate validation set was used and the 
hyper-parameter was selected directly on the test set. 

We also analyzed spatiotemporal patterns of data, which were 
created by concatenating consecutive state vectors. For the 
spatiotemporal experiments a bin width of 6.7 ms, corresponding 
to the frame rate of the stimulus, was used. This bin size is a 
compromise capturing more detailed structure in the data without 
leading to an undue increase in dimensionality and complexity. 
Up to 13 time bins were concatenated in order to discover 
spatiotemporal patterns and predict spiking given the history over 
the prior 87 ms. These models were trained on 100,000 samples 
and likelihoods were computed on a set of equal size. 

Model and estimation 

The sRBM consists of a set of binary visible units x e {0,1}^ 
corresponding to observed neurons in the data and a set of hidden 
units h G {0, 1 }^ that capture higher-order dependencies. Weights 
between visible units, corresponding to an Ising model or fully 



a) Histological slice b) Current Source Density c) Temporal cross-correlation 




25 30 35 ms Lag / ms 

Figure 7. Experimental methods, (a) Example recording site with lines indicating layer boundaries and a schematic drawing of the 32-channel 
probe, (b) Current source density (the second spatial derivative of the LFP) in response to a full-field flash stimulus, showing a strong current sink in 
layer IV and to a lesser degree in layer VI. CSD was used to assign layer boundaries, (c) Cross-correlation for one pair of cells across time, binned at 
6.7 ms. Correlations fall off quickly as the time lag increases with a central peak extending +20ms indicated by dashed lines. 
doi:1 0.1 371 /journal.pcbi.1 003684.g007 
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visible Boltzmann machine, capture pairwise couplings in the data. 
Weights between visible and hidden units, corresponding to an 
RBM, learn to describe higher-order structure. 

The Ising model with visible-visible coupling weights J e IZ^ ^ ^ 
and biases b G IZ^ has an energy function 

£'i(x)=-x^Jx-b^x, (1) 

with associated probability distribution />i(x) = ■^exp[ — £'i(x)], 
where the normalization constant, or partition function, Z\ = 
exp[ — £'i(x')] consists of a sum over all 2^ system states. 

The RBM with visible-hidden coupling weights We7^^''^ 
and hidden and visible biases by G IZ^ and b/j e IZ^ has an energy 
function 

£R(x,h)=-x^Wh-b,^x-b[h, (2) 

with associated probability distribution /77?(x,h) = -^exp 

[ — £'R(x,h)]. Since the there are no connections between hidden 
units (hence "restricted" Boltzmann machine), the latent variables 
h can be analytically marginalized out of the distribution (see 
supplementary information) to obtain 

p(x) = I rfh/,(x,h) = ^ exp[ - £r(x)] . (3) 

This step gives a standard energy-based model which we can 
estimate in our framework, while in a fully connected Boltzmann 
machine we could not marginalize over hidden units, making the 
estimation intractable. The energy for the marginalized distribu- 
tion over X (sometimes referred to as the free energy in machine 
learning literature) is 

£r(x) = - ^ log(l + e^f^+^'-.O - b,^x, (4) 



where w^- is the / row of the matrix W. The energy function for 
an sRBM combines the Ising model and RBM energy terms, 

Es (x,h) = - x^ Jx - x^Wh - b,^x - bjh. (5) 



peri-stimulus time histogram (PSTH), i.e. the response to a given 
stimulus obtained by empirically computing the firing probabilities 
averaged over repeated presentations. This non-parametric model 
has the form 

£H(x)=-bJx (8) 

where the subscript H refers to histogram, and both the dynamic 
bias term bj and the data vector x have explicit time dependence. 
The dynamic bias terms in this model can be computed in closed 

form /?,(0 = log^— ^ where the PSTH p{t)= J2xi(t)/N sums 

over stimulus repetitions and counts the number of spikes fired by 
the neuron. Starting with this model and keeping the dynamic bias 
term fixed at the closed form solution, we add pairwise and higher- 
order coupling terms to obtain 

Ejh(x) = - x^ Jx - b^x - bjx (9) 



ERH(x,h)= -x^Wh-b[x-b[h-bJx, (10) 

which combine the Ising and RBM population models, respec- 
tively, with the stimulus model given by the PSTH term. 
Estimation of these models is no more difficult than the standard 
Ising and RBM models, since the PSTH term is computed in 
closed form and effectively only adds a constant to the energy 
function. This method corresponds to model T2 (explicit time 
dependence, second order stimulus dependence) of [36], where it 
is not further explored due to the requirement for repeated 
stimulus segments. We used 60 repetitions of a 30 s long natural 
movie to estimate the PSTH. 

Instead of CD, which is based on sampling, we train the models 
using Minimum Probability Flow (MPF, [13]), a recently 
developed parameter estimation method for energy based models. 
MPF works by minimizing the KL divergence between the data 
distribution and the distribution which results from moving slightly 
away from the data distribution towards the model distribution. 
This KL divergence will be uniquely zero in the case where the 
model distribution is identical to the data distribution. While CD is 
a stochastic heuristic for parameter update, MPF provides a 
deterministic and easy to evaluate objective function. Second 
order gradient methods can therefore be used to speed up 
optimization considerably. The MPF objective function 



As with the RBM, it is straightforward to marginalize over the 
hidden units for an sRBM, 



;7s(x) = ^exp[-Es(x)], (6) 



Es(x)=-x^Jx- ^^logCl+^^^'^+^/^^O-brx. (7) 



A hierarchical Markov Random Field based on the sRBM has 
previously been applied as a model for natural image patches [34] , 
with the parameters estimated using contrastive divergence (CD) 
[35]. 

To include stimulus effects into the Boltzmann machine models, 
we start with a maximum entropy model constrained to fit the 



(11) 



measures the flow of probability out of data states x into 
neighboring non-data states x', where the connectivity function 
g(x,x')G{0,l} identifies neighboring states, and 7) is the list of data 
states. We consider the case where the connectivity function 
g(x,x') is set to connect all states which differ by a single bit flip 



g(x,x') 



1 //(x,x') = 1 
0 otherwise 



(12) 



where H(x,x') is the Hamming distance between x and x'. See 
supplementary information for a derivation of the MPF objective 
function and gradients for the sRBM, RBM, and Ising models. In 
all experiments, minimization of K was performed with the 
MinFunc implementation of L-BFGS [37]. 
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To prevent overfitting all models were estimated with an Li 
sparseness penalty on the coupling parameters. This was done by 
adding a term of the form /I ^ \ Jij\ + X ^ | Wjk\ to the objective 
function, summing over the absolute values of the elements of both 
the visible and hidden weight matrices. The optimal sparseness X 
was chosen by cross-validating the log-likelihood on a holdout set, 
with the value of /I selected from the set A = [0,1,2,4,6,8,10] x 10~^ 
spanning the range of optimal regularization for all models. 

Since MPF learning does not give an estimate of the partition 
function, for models that were too large to normalize by summing 
over all states, we use annealed importance sampling (AIS, [16]) to 
compute normalized probabilities. AIS is a sequential Monte 
Carlo method that works by gradually morphing a distribution 
with a known normalization constant (in our case a uniform 
distribution over x) into the distribution of interest. See 
supplementary information for more detail. AIS applied to RBM 
models is described in [38], which also highlights the shortcomings 
of previously used deterministic approximations. Since AIS 
requires running a sampling chain, in our case a Gibbs sampler, 
it generally takes much longer than the parameter estimation. 

Normalizing the distribution via AIS allows us to compute the 
log likelihood of the model /?m and compare it to the likelihood 
gained over a baseline model. This baseline assumes cells to be 
independent and characterized by their firing rate 
/7r(x)= Iii{riXi-\-{\—ri){\—Xi)) with rates for individual cells 
/. The independent model is easily estimated and normalized, and 
is commonly used as a reference for model comparison. The excess 
log likelihood over this baseline is defined in terms of a sample 
expectation as Y^^eV [logi/^mCx) -log2/?r(x)]. 

The excess log likelihood rate computed in bits/spike by 
normalizing with the population firing rate per time bin, is used as 
the basis for model comparisons. Normalizing by the firing rate 
and comparing bits/spike rather than bits per unit time has the 
advantage that this measure is less sensitive to the overall activity 
when comparing across data sets. For models with stimulus terms, 
a separate normalization constant is required for each state of the 
dynamic bias. As our 30 s stimulus segment data set consists of 
1500 time bins, this amounts to computing 1500 separate partition 
functions. This limits us to small models with up to 20 neurons 
where the partition function can be computed quickly by 
enumerating the full state space, avoiding the use of the costly 
AIS sampling procedure. To compare likelihoods, we use the 
PSTH model (i.e. the model with only stimulus and no coupling 
terms) as the baseline since all three likelihoods are much higher 
than for models without stimulus terms. 

Supporting Information 

Figure SI Monitoring the convergence of the AIS 
estimate for the partition function. Example shows a 20- 
dimensional Ising model. Each entry on the horizontal axis 
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